Learning Articles - Data Science

Lacrimae rerum. Memento mori. Memento vivere.

Pandas Exploratory Data Analysis

Data analysis is the process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information and informing conclusions to support decision-making. This involves applying inferential statistical analyses and creating visualizations in order to interpret the results and summarize the main characteristics of the data. The typical information and conclusions extracted from the data include interactions, patterns, and anomalies. These notes rely on the ideas and learnings from the respective package documentations, "Python For Data Analysis: Data Wrangling With Pandas, NumPy, And Jupyter", 3rd Edition, by Wes McKinney (creator and developer of Pandas) in 2022, and "Python Data Science Handbook: Essential Tools For Working With Data", 2nd Edition, by Jake VanderPlas in 2022.

https://pandas.pydata.org/docs/ https://pandas.pydata.org/docs/user_guide/cookbook.html https://wesmckinney.com/book/ https://jakevdp.github.io/PythonDataScienceHandbook/

When using Python for data analysis and data science, the most common packages and libraries include NumPy as the numeric library founding all of the calculations; Pandas as the cornerstone of data manipulation; Matplotlib, Seaborn, and Plotly for intricate visualizations; Statsmodels for advanced statistical functions; SciPy for advanced scientific computing; and Scikit-Learn for as a toolkit for machine learning; and TensorFlow and PyTorch for artificial intelligence applications. For convenience, Anaconda can be used for a distribution of Python with pre-installed packages which focus on data analysis and data science. In many cases, the vast and open network of packages and libraries available for Python can be leveraged depending on the requirements of projects. It should be kept in mind that the use of Python diverges from traditional tools used for data analysis which are primarily visual through point-and-click interfaces, such as Microsoft Excel and Tableau, and become difficult to use when processing very large sets of data.

https://www.python.org/ https://pandas.pydata.org/ https://github.com/pandas-dev/pandas

The process of data analysis generally follows data extraction, data cleaning, data wrangling, analysis, and resultant action (although the process usually resembles a repetitive cycle rather than a linear path). The data extraction is associated with sourcing data from local or online databases stored in SQL, CSV, XML, JSON, or another file format. The data cleansing is associated with accounting for missing values, empty sets, invalid fields, and other errors. The data wrangling is associated with merging, combining, and joining data to re-arrange and re-shape the data into categories, hierarchies, or indices. The analysis is associated with exploration to extract results from the data which involves statistical analyses to identify the underlying trends and characteristics. The resultant action is associated with the subsequent recommendations which result from the knowledge gained from the overall process.

Installation And Setup

Pandas (short for "Panel Data" and associated with "Python Data Analysis") provides high-level functionality designed to make using structured data convenient and flexible. This functionality is built on the functionality of NumPy, ..., and ... and allows for capabilities expanding from intuitive indexing and data manipulation. The vast applicability of Pandas provides for capabilities for reading and writing a variety of file formats and data stores; cleaning, munging, combining, normalizing, reshaping, slicing, and transforming data; applying mathematical and statistical operations and transformations to groups of data to derive new sets of data; connecting data to statistical models, machine learning algorithms, and other computational tools; and creating static or interactive graphical visualizations or textual summaries for presentation.

The prerequisites to install Pandas are NumPy, Dateutil, and Pytz (with more optional dependencies for performance, visualization, computation, and other data sources). Pandas can usually be installed through a package manager, as conventionally performed using Pip, or, alternatively, through the native package manager of a Linux distribution (although this version may be outdated or may not be officially maintained). For advanced developers, Pandas can be built and installed from its source code with control over options for compiling. Once installed, Pandas can be imported into a project.

Install Pandas using Pip from Python Packaging Index (PyPI) or Conda from Anaconda:

pip install pandas

pip install --upgrade numpy

conda install pandas

conda update pandas

Import Pandas in a script to be used for a project (with conventional shorthand):

import pandas

import pandas as pd

Modify the configuration of Pandas with commonly used ...settings... :

...

pandas.options.display.max_rows = 20

pandas.options.display.max_columns = 20

pandas.options.display.max_colwidth = 80

...

... Creation

With regard to structured data, the most common forms include tabular data (in which each column may be a different type, but each value in a column is the same type), multi-dimensional arrays, multiple tables interrelated by key columns, and evenly or unevenly spaced time series data. In Pandas, structured data can be represented through a series, as a homogeneous and 1-dimensional array with labels for an index, or data frame, as a heterogeneous and column-oriented table with labels for rows and columns (although 2-dimensional, it is possible to represent higher dimensional data in a tabular format using hierarchical indexing). From a high level, a data frame is a programming interface for expressing data manipulations on tabular datasets in a general programming language and whose primary modality is analytical. Compared to SQL-based systems, data frames often use imperative or procedural constructs (emphasis on iterations with a sequence of operations for manipulation), offer access to internal structures, expose operations outside of traditional relational algebra (taking advantage of ordering of records within datasets), and have stateful semantics (...).

https://pandas.pydata.org/docs/user_guide/dsintro.html https://pandas.pydata.org/docs/reference/series.html https://pandas.pydata.org/docs/reference/frame.html https://pandas.pydata.org/docs/reference/indexing.html

A series can be created from a list, dictionary, or array with the index being optionally defined to identify each value with a label. For an alternative perspective, a series can be thought of as a fixed-length and ordered dictionary, as it is a direct mapping of data and index values. A data frame can be created from a list, dictionary, or array with the index and columns being optionally defined to identify each value with labels (using a nested dictionary of dictionaries, the keys of the outer dictionary will be used for the columns and keys of the inner dictionary will be used for the rows). For an alternative perspective, a data frame can be thought of as a dictionary of series which share the same index. It should also be noted that the index of a series or data frame behaves like a fixed-size set (with allowance for duplicate values), where there are also specialized types, such as for monotonic integers, intervals, time, or multi-level objects.

Initialize a series with the given values along the single dimension:

series (data = None, index = None, dtype = None, name = None, copy = None, fastpath = False)

Initialize a data frame with the given values for the rows and columns:

pandas.DataFrame (data = None, index = None, columns = None, dtype = None, copy = None)

Initialize an index to be used as the labels for an axis of a series or data frame:

pandas.Index (labels = None, dtype = None, copy = False, name = None, tupleize_cols = True)

pandas.RangeIndex (start = None, stop = None, step = None, dtype = None, copy = False, name = None)

pandas.IntervalIndex (data, closed = None, dtype = None, copy = False, name = None, verify_integrity = True)

pandas.DatetimeIndex (data = None, freq = _NoDefault.no_default, tz = _NoDefault.no_default, normalize = False, closed = None, ambiguous = "raise", dayfirst = False, yearfirst = False, dtype = None, copy = False, name = None)

pandas.TimedeltaIndex (data = None, unit = None, freq = _NoDefault.no_default, closed = None, dtype = None, copy = False, name = None)

pandas.PeriodIndex (data = None, ordinal = None, freq = None, dtype = None, copy = False, name = None, **fields)

pandas.MultiIndex (levels = None, codes = None, sortorder = None, names = None, dtype = None, copy = False, name = None, verify_integrity = True)

The information and properties intrinsic to a series, data frame, or index are reflected by the attributes of the series, data frame, or index. For a series, the common attributes include the underlying index, underlying array of data, shape as the size along each dimension, and name assigned to the series. For a data frame, the common attributes include the underlying index for the rows and columns, underlying array of data, and shape as the size along each dimension. For an index, the primary attributes includes the labels of the axis and other metadata, such the shape as the size along each dimension and name assigned to the index (it should be noted that these objects are immutable and cannot be directly modified).

https://pandas.pydata.org/docs/reference/api/pandas.Series.html https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html https://pandas.pydata.org/docs/reference/api/pandas.Index.html

A series can be indexed to create a slice by a list or tuple of integers (positional indexing), booleans (logical indexing), or labels associated with the index (dictionary-like notation). A data frame can also be indexed to create a slice by a list or tuple of integers (positional indexing), booleans (logical indexing), or labels associated with the index and columns (dictionary-like notation). It is also possible to use dot notation to access a column of a data frame (attribute-like notation (must be a valid variable name), although this cannot be used to create a new column). However, the preferred way of indexing is actually using the supplied methods for positional, logical, or label indexing, as this allows for more consistent behaviour regardless of the data type used and avoids ambiguity. As mentioned, the result of selecting a single row or column in a data frame is a series with an index which contains the column or row labels. Accommodation can also be made to select single scalar values directly within a series or data frame for improved performance with less overhead.

...[A distinction needs to be made between basic indexing and advanced indexing. The primary difference between basic indexing and advanced indexing is that basic indexing will only select a slice from an array, while advanced indexing will select an arbitrary group from an array (allows for repetition of indices). Under basic indexing, a slice of the original array is referenced, where this slice is a view (use the same values in memory) and any modification to the view will be reflected in the original array (need to explicitly specify a copy to create a new object). Under advanced indexing, a group from the original array is created, where this group is a copy and ...acts as... a new object. It should be noted that selecting data by boolean indexing and assigning the result will always create a copy of the data. In addition, the search order for indexing is row-major (fill the consecutive elements of a row before moving to subsequent rows).]...

https://pandas.pydata.org/docs/user_guide/indexing.html https://pandas.pydata.org/docs/user_guide/advanced.html

Could make image comparing orders: https://en.wikipedia.org/wiki/Row-_and_column-major_order and https://wesmckinney.com/book/numpy-basics.html#figure_ndarray_indexing.

Basic Data Manipulation

Although there are several general functions, most of the manipulation of data is performed through methods. For data alignment, the index of a series or data frame can be modified to re-order existing data and fill locations without values (with a NaN by default). In a similar way, it is possible to set the index using a column of a data frame or reset the index of a series or data frame. Additional values can be inserted into or appended onto an array, while other values can be deleted using the appropriate indices. In addition, it should be noted that a series or data frame often behaviours similar to a ndarray and can often be used as an input to universal functions from NumPy (or converted into an array, where the data type will be chosen to accommodate all of the columns).

Return a copy with a modified index to re-order existing data and fill locations without values:

series.reindex (index = None, *, axis = None, method = None, copy = None, level = None, fill_value = None, limit = None, tolerance = None)

data_frame.reindex (labels = None, *, index = None, columns = None, axis = None, method = None, copy = None, level = None, fill_value = nan, limit = None, tolerance = None)

series.align (other, join = "outer", axis = None, level = None, copy = None, fill_value = None, method = None, limit = None, fill_axis = 0, broadcast_axis = None)

data_frame.align (other, join = "outer", axis = None, level = None, copy = None, fill_value = None, method = None, limit = None, fill_axis = 0, broadcast_axis = None)

series.rename (index = None, *, axis = None, copy = None, inplace = False, level = None, errors = "ignore")

data_frame.rename (mapper = None, *, index = None, columns = None, axis = None, copy = None, inplace = False, level = None, errors = "ignore")

series.set_axis (labels, *, axis = 0, copy = None)

data_frame.set_axis (labels, *, axis = 0, copy = None)

series.reset_index (level = None, *, drop = False, name = _NoDefault.no_default, inplace = False, allow_duplicates = False)

data_frame.reset_index(level = None, *, drop = False, inplace = False, col_level = 0, col_fill = "", allow_duplicates = _NoDefault.no_default, names = None)

Return a copy with values added to or removed from the input series, data frame, or index:

insert append

series.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

data_frame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

pandas.Index.drop(labels, errors='raise')

delete

Calculations And Operations

When performing arithmetic operations, the corresponding values with a common index are added, subtracted, multiplied, or divided, where missing values (or a chosen fill value) will be introduced for the locations which do not have a common index. Thus, the results will have an index which is a union of the index from each of the parts of the operation (using series or data frames which have no common index will result in only missing values). This is comparable to performing the operations in an element-wise manner as a batch operation. Similarly, logic functions can be used to evaluate an array with the results given as a boolean in an element-wise or matrix-wise manner - common logic functions include evaluation whether arrays are greater than, less than, or equal to variants. ...If the arrays are not the same shape, they must be broadcastable to a common shape along the dimensions... It should be noted that, due to this broadcasting and whenever an operation involves an array with a scalar, an element-wise operation will be performed, where the scalar is applied to each element of the array based on the operation.

Return an array of the normal element-wise sum, difference, product, or quotient between input arrays:

data_frame_augend.add (data_frame_addend, axis = "columns", level = None, fill_value = None)

data_frame_minuend.sub (data_frame_subtrahend, axis = "columns", level = None, fill_value = None)

data_frame_multiplicand.mul (data_frame_multiplier, axis = "columns", level = None, fill_value = None)

data_frame_dividend.div (data_frame_divisor, axis = "columns", level = None, fill_value = None)

data_frame_dividend.mod (data_frame_divisor, axis = "columns", level = None, fill_value = None)

data_frame_base.pow (data_frame_exponent, axis = "columns", level = None, fill_value = None)

Return an array of the reverse element-wise sum, difference, product, or quotient between input arrays:

data_frame_addend.radd (data_frame_augend, axis = "columns", level = None, fill_value = None)

data_frame_subtrahend.rsub (data_frame_minuend, axis = "columns", level = None, fill_value = None)

data_frame_multiplier.rmul (data_frame_multiplicand, axis = "columns", level = None, fill_value = None)

data_frame_divisor.rdiv (data_frame_dividend, axis = "columns", level = None, fill_value = None)

data_frame_divisor.rmod (data_frame_dividend, axis = "columns", level = None, fill_value = None)

data_frame_exponent.rpow (data_frame_base, axis = "columns", level = None, fill_value = None)

Return an array of the element-wise boolean result from a logical comparison (invert with ...):

data_frame_standard.gt (data_frame_basis, axis = "columns", level = None)

data_frame_standard.ge (data_frame_basis, axis = "columns", level = None)

data_frame_standard.lt (data_frame_basis, axis = "columns", level = None)

data_frame_standard.le (data_frame_basis, axis = "columns", level = None)

data_frame_standard.eq (data_frame_basis, axis = "columns", level = None)

data_frame_standard.ne (data_frame_basis, axis = "columns", level = None)

If there is not a built-in function, it is possible to create a function performing the desired operations and then apply this function along an axis of or as an element-wise operation on a data frame. A distinction can also be made when transforming data (but keeping a consistent structure) relative to aggregating data (and having a modified structure). However, in most cases, there will be a suitable built-in function to use for statistics (mean, median, standard deviation, etc), sorting (numerical, alphabetical, ascending, descending, etc), and sets (unique, ..., etc).

Apply a function along an axis of or as an element-wise operation on a data frame:

data_frame.apply (function_handle, axis = 0, raw = False, result_type = None, args = (), **kwargs)

data_frame.applymap (function_handle, na_action = None, **kwargs)

data_frame.transform (function_handle, axis = 0, *args, **kwargs)

data_frame.aggregate (function_handle = None, axis = 0, *args, **kwargs)

Examples of common functions used for statistics:

...
				min
				max
				sum
				mean
				median

...

Examples of common functions used for sorting:

...

...

Examples of common functions used for sets:

...

...

... :

...

...

This webpage and website is only designed and published to be a general recreational, informational, and educational resource and should not be taken as personalized advice nor as a recommendation in any form. The content discussed on this webpage and website includes examples only and may not be appropriate for individual circumstances and implementation. This content is solely the opinions of the participants derived through personal means and experiences, where it should not be regarded as exhaustive nor prescriptive for any individual or group. The views presented by the participants are entirely their own and based on past environments and situations which may have fluctuated or been superseded by subsequent events and no duty is assumed to update or correct any forward-looking or incorrect statements. The participants should not be regarded as experts or qualified professionals in any discipline. No representation or warranty, either written, oral, statutory, expressed or implied, is made or given by or on behalf of any of the participants as to the accuracy, completeness, reliability, or fairness of the information contained in the content, including but without limitation to duties or conditions of or related to merchantability, fitness for a particular purpose, and lack of negligence. Any liabilities or damages as a result of the content, either direct or indirect, is expressly disclaimed. By continuing to view this website, it is assumed that the full disclaimer on the About page has been read and understood, otherwise please close this webpage immediately. In addition, please close this webpage immediately if any content on this website is not permitted in the location from which it is being accessed or does not satisfy or conform with the relevant regulations, laws, or contracts in the location from which it is being accessed. The continued use of this website and any of its webpages is solely and entirely at the own risk of the visitor accessing it.